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Abstract 

This article firstly attempts to explore parallel algorithms of learning distributed 
representations for both entities and relations in large-scale knowledge 
repositories with MapReduce programming model on a multi-core processor. 

We accelerate the training progress of a canonical knowledge embedding 
method, i.e. translating embedding (TransE) model, by dividing a whole 
knowledge repository into several balanced subsets, and feeding each subset into 
an individual core where local embeddings can concurrently run updating during 
the Map phase. 

However, it usually suffers from inconsistent low-dimensional vector 
representations of the same key, which are collected from different Map workers, 
and further leads to conflicts when conducting Reduce to merge the various 
vectors associated with the same key. 

Therefore, we try several strategies to acquire the merged embeddings which 
may not only retain the performance of entity inference, relation prediction, and 
even triplet classification evaluated by the single-thread TransE on several 
well-known knowledge bases such as Freebase and NELL, but also scale up the 
learning speed along with the number of cores within a processor. 

So far, the empirical studies show that we could achieve comparable results as 
the single-thread TransE performs by the stochastic gradient descend (SGD) 
algorithm, as well as increase the training speed multiple times via adapting the 
batch gradient descend (BGD) algorithm for MapReduce paradigm. 

Keywords: parallel knowledge embedding; MapReduce; stochastic gradient 
descend (SGD); batch gradient descend (BGD) 


1 Introduction 

The emerging trend of learning distributed representations for both entities and 
relationships using knowledge repositories [1] without extra text has drawn much 
attention from academia as well as industry. The learnt embeddings can not only 
perform significant improvements on several subtasks for knowledge population 
such as entity inference, relation prediction and triplet identification, but also fur¬ 
ther facilitate other AI applications, such as Question Answering Systems and 
Automatic Summarization Systems. Almost all the studies of knowledge embed¬ 
ding leverage existing repositories such as Wordnetl^l [2], Freebase^^l [3, 4] and 
NELL [^1 [5], which contain billions of triplets, and each of them is represented by 
< head.entity, relation, tail-entity > abbreviated as < h,r,t >. Furthermore, each 


https://wordnet.princeton.edu/ 
https : //www. freebase . com/ 
http: //rtw. ml. emu. edu/rtw/ 
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knowledge base is considered as a huge directed graph where nodes are entities, edges 
are relations and a triplet <h,r,t > implies a structure that there is a relationship 
r between the entities h and t, pointing from the head entity h to the tail entity 
t. This hierarchical structure inspired Bordes et al. [6] to interpret relationships as 
translation vectors operating between the embeddings of two entities. Hence, they 
have proposed a canonical model named TransE (translating embedding) and led 
the prosperous research of knowledge embedding [7, 8, 9, 10, 11, 12, 13] that fo¬ 
cuses on devising models easy to train, containing reduced number of parameters 
and expected to scale up to very large knowledge bases. 

However, to the best of our knowledge, the classical approach TransE [6] and 
its successors [7, 8, 9, 10, 11, 12, 13] conventionally reported the outstanding re¬ 
sults of experiments conducted on relatively small datasets such as WNIOOK and 
EB150KW, but few of them mentioned the efficiency of training with really large 
scale datasets. The reason we believe is that all the models including the state-of-the- 
art, only demonstrated the training algorithms within the programming paradigm 
of single thread, which extremely seals the potential power of multi-core computing. 

Multi-core processors enable the computation conducted simultaneously, but re¬ 
form the paradigm of programming. An intuitive way of parallel computing is to 
divide the dataset and to feed each core with each subset to perform individual 
calculations. Therefore, the key challenges of parallel computing with a multi-core 
processor are: how to schedule the computation tasks, maintain the algorithm of 
each core and synchronize the shared memory if necessary. To tackle these issues, 
Google researchers designed a well-known programming model called MapReduce 
]14] for distributed computing, and it has been widely adopted by both commercial 
and open-source clusters. MapReduce generally provides two interfaces to program¬ 
mers for adapting their single-thread algorithms: 

• Map is responsible for processing the input data, and emits a set of interme¬ 
diate key/value pairs based on the algorithm devised by programmers. 

• Reduce accepts an intermediate key and a set of values for the key in the 
meanwhile. It uses customized function to merge the values together, and 
finally produces zero or one output for that key. 

During the period of operations, an independent master schedules multiple Map 
and Reduce workers to run concurrently and makes parallel computing possible. 

So far, several quintessential machine learning models have already been adapted 
to the framework of MapReduce by Chu et al. [15]. However, embedding-based mod¬ 
els such as word embedding [16] and knowledge embedding [1] are different from those 
traditional learning approaches like logistic regression, as the data itself constructs 
the parameter space of embedding-based models. For example, knowledge embed¬ 
ding models acquire the low-dimensional vector representation of each entity and 
relation within a repository, and if we split that knowledge base and distribute sub¬ 
sets into multiple Map workers, the parameter space is divided at the same time 
as well. It usually leads to inconsistent distributed representations of the same key 
when conducting Reduce with those intermediate key/value pairs emitted by many 
Map workers. The problem is still challenging and has not been properly solved yet. 

[^'We change the original names, wn and fbisk, of the two datasets to emphasize 
the sizes of them. 



Fan et al. 


Page 3 of 6 


Therefore, this paper firstly attenipts to explore parallel algorithms of learning 
distributed representations for both entities and relations in large-scale knowledge 
repositories with MapReduce programming model on a multi-core processor. We 
accelerate the training progress of the canonical knowledge embedding method, i.e. 
translating embedding (TransE) model [6], by dividing a whole knowledge reposi¬ 
tory into several balanced subsets, and feeding each subset into an individual core 
where local embeddings can concurrently run updating during the Map phase. 

To deal with the merging problem of Reduce phase caused by inconsistent low¬ 
dimensional vector representations of the same key collected from different Map 
cores, we contribute several strategies to acquire the embeddings which may not 
only retain the performance of entity inference, relation predietion, and even triplet 
elassification evaluated by the single-thread TransE on several well-known knowl¬ 
edge bases such as Freebase and NELL, but also scale up the learning speed along 
with the number of cores within a processor. So far, the empirical studies show 
that we could achieve comparable results as the single-thread TransE performs by 
the stochastic gradient descend (SGD) algorithm as well as increase the training 
speed multiple times, via adapting the batch gradient deseend (BGD) algorithm for 
MapReduee paradigm. 

Here we arrange the subsequent content of this article as follows: Section 2 helps 
us to recap the TransE model and its single-thread training algorithm based on 
stoehastic gradient deseend (SGD). Then, we adapt the single-thread algorithm 
into the framework of MapReduce in Section 3. Specifically, Section 3.1 talks about 
directly parallelizing the SGD-based algorithm within the MapReduce paradigm, 
and proposes three strategies of reducing the conflicted embeddings. Section 3.2, 
from another perspective, focuses on concurrently updating the gradients rather 
than the inconsistent embeddings to avoid conflicts, and suggests adapting the bateh 
gradient deseend (BGD) algorithm instead into the MapReduce paradigm. (TO BE 
CONTINUED...) 

2 Singlethread TransE 

Starting from 2011, when Bordes et al. [1] engaged in the representation learning 
of entities and relationships within knowledge bases, the literatures on knowledge 
embedding had been increasing, but few of them could support large-scale train¬ 
ing within acceptable time due to the high complexity of their models. Until the 
year of 2013, Bordes et al. [6] devised TransE, a canonical model which pursued 
only one low-dimensional vector representation for each entity and each relation¬ 
ship. The reduced parameters required made it possible to scale up to very large 
datasets. Moreover, TransE learnt those distributed representations with a single- 
layer energy-based model which was rather easier to train. 

2.1 Translating Embedding Model 

Inspired by the hierarchical structures of the relationship between two entities shown 
by Figure 1(a), Bordes et al. interpreted relationships as translations operating on 
entities and assumed that if a triplet < h,r,t > holds, then h -F r « t. The boldface 
characters indicate k dimensional vector representations (h, r and t) taking values 
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in Equation (1) was derived from that assumption to measure the energy of a 
triplet < h, r, t > as follows, 


d(h,r,t) = ||h + r-t||, 


( 1 ) 


in which we could take either Li or L 2 -norin to measure the dissimilarity. Given a 
training set A of triplets < h,r,t > composed of the set of entities E {h,t € E), and 
the set of relations R {r € R), we expected to discriminate the training triplets from 
corrupted triplets. One corrupted triplet < > from its set A' was constructed 

according to Equation (2) by replacing either the head or tail entity randomly of a 
ground-truth triplet from the training set: 


^[h,r,t) = {{h',r,t)\h' €E kh' ^ h}U{ih,r,t')\t' €E kt' ^t}. (2) 


As we favored lower energy of training triplets than the corrupted ones, the objective 
was naturally to minimize the loss of a margin-based ranking function C over all 
the training triplets: 

£= y] y] [ 7 -bd(h,r,t)-d(h',r,t')]+, (3) 


where [a:]+ was the hinge loss function that only considered the positive part of x 
([a;]+ = max{ 0 ,a:}), and 7 was that positive margin (7 > 0 ). 

2.2 SGD-based Singlethread Algorithm 


Algorithm 1 The SGD-based Singlethread Learning Algorithm of TransE 
Require: 

Training set A = {{h,r,t)}, entity set E, relation set i?; dimension of embeddings d, margin 7, 
learning rate a for C , convergence threshold e, maximum epoches n. 

for r G i? do 


r:=Uniform(^,^) 

FT 

end for 

i := 0 

while Rel.loss > e and i < n do 

for e S E do 

e:= Uniform(^,^) 

R 

end for 

for {h, r, t) G A do 


end for 
end while 
Ensure: 

All the embeddings of e and r, where e G E and r £ R. 


3 MapReduce TransE 

3.1 SGD-based MapReduce Paradigm 

3.1.1 SGD-based Map phase 

3.1.2 SGD-based Reduce phase 

• Random knowledge embedding 
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• Average knowledge embedding 

• Mini-loss knowledge embedding 

3.2 BGD-based MapReduce Paradigm 

3.3.1 BGD-based Map phase 

3.2.2 BGD-based Reduce phase 

4 Experiment 

5 Discussion 

6 Conclusion 

This article firstly attempts to explore parallel algorithms of learning distributed 
representations for both entities and relations in large-scale knowledge repositories 
with MapReduce programming model on a multi-core processor. 

We accelerate the training progress of a canonical knowledge embedding method, 
i.e. translating embedding (TransE) model, by dividing a whole knowledge reposi¬ 
tory into several balanced subsets, and feeding each subset into an individual core 
where local embeddings can concurrently run updating during the Map phase. 

However, it usually suffers from inconsistent low-dimensional vector representa¬ 
tions of the same key, which are collected from different Map workers, and further 
leads to conflicts when conducting Reduce to merge the various vectors associated 
with the same key. 

Therefore, we try several strategies to acquire the merged embeddings which may 
not only retain the performance of entity inference, relation prediction, and even 
triplet classification evaluated by the single-thread TransE on several well-known 
knowledge bases such as Freebase and NELL, but also scale up the learning speed 
along with the number of cores within a processor. 

So far, the empirical studies show that we could achieve comparable results as the 
single-thread TransE performs by the stochastic gradient descend (SGD) algorithm, 
as well as increase the training speed multiple times via adapting the batch gradient 
descend (BGD) algorithm for MapReduce paradigm. 
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